Import Libraries

Load in the Data using Pandas

View the Data in Pandas

Explore the Data in Pandas

Frequencies

get counts of the values of the variables to determine if they can be dummy coded for the regression if needed to be

Too many engine values, if this categorical data needs to be recoded later on

Too many car model values, if this categorical data needs to be recoded later on. That would be too many dummies

we can view bar charts to visualize the data counts

Descriptive Statistics

Run descriptive statistics in python to learn more about the data

we are able to get descriptive statistics easily on anything that s numeric, by using the .describe function on the dataset and variable name

This shows the descriptive statistics for the numeric variables.

The count for the numeric variables run by the functions, differ, this shows that some of the data in these variables may be missing (price and engine)

(Update from the study guide provided by Dr. MO) Let's observe the non-numeric variables using the describe function

use the parameter 'include' in the descriptive function to view all columns of the data

As noted above, the Model class has too many models and would require too many dummies to be coded in order to use categorical predictor variables in the regression model - this will not be a variable of interest so it can be dropped from the dataset

Declare a new variable to save the new dataset to, once we drop the model column from the data and to save the changes to

the 'axis=1' tells python to apply the drop to the column and not the rows

Now lets take a look at the missing values we described earlier

Lets see just how many missing values there are in this dataset

There are 172 rows missing data from the Price column, and 150 rows missing data from the EnginveV column. These will need to be removed

Remove the missing Values

for the 'dropna' function the axis parameter will need to be defined by either 0 - to remove the row or 1- which with remove the column. Since we are not droping the column, just the missing data, we will remove the row containing the missing data

Listwise deletion of data can be dangerous, as too much data could be removed to perform a complete and accurate analysis of the data. This was used here since not much of the data seemed to be missing.

we can take a look to make sure all of the missing data has been dropped

the count for each column is now equal, and the Model column is now gone, as it had too many unique instances

We can draw a few observations now from exploring the data

The cheapest car in the dataset is 600. The most expensive car in the data is 300,000. The year for the most expensive car is 2016. The year for the least expensive car was 1969. Does the mean that the year is the most relevant feature to use to predict the car price in our model? Could be!

Assumptions of Linear Regression

Assumptions are important when checking how accurate the results from the statiscal analysis will be. Failing an ssumption or two is okay to move forward, however it is important to always note when an assumption has failed when using the results to make decisions

Testing for Linearity and Normality

Now that the data has been cleaned of missing values, we can now try to get a feel for the distribution of the data. A picture is truly worth a thousand words, so lets visiualize the data and use data visualization to try to grasp a narrative of the data.

First lets take a look at histograms for all of the data

It is always useful to plot data to observe if a relationship can be found

These scatterplots show the relationships between the data and we can take note of the outliers present in the various graphs from the data. We can also view if any of the relastionships between x and y in the data appears linear.

Now let's take a look at a histogram for our dependent variable, we can use seaborn to apply a best fit line to the histogram

The data looks relatively skewed here, and according to the describe function ran earlier, the price data does have outliers. the mean of the price column was 19552.308065, while the min amount was 600. This indicates that the min value in the price column, is an outlier.

Since the data does not follow a normal distribution, we can calculate the outlier data using the quantile method in pandas. We can create a function that outputs the outliers, using the 1.5 outliet rule to find the outliers

run the function on the price column in the dataset to see how many outliers there are

There are 355 outliers according to the function created, we could drop the outliers, potentially losing data, or lose less data by only dropping 1% of the data and keeping all of the data within the 99th percentile

we can define a variable that will hold all of the data from the 99th percentile

define another variable that will store all of the values less than the 99th percentile into the price column

view the descriptive statistics on the new stored data

The max values in the dataset are now lower without some of the outliers, giving us slightly more accurate data to work with

we can take a look at the histogram again with a best fit line to see the transformed data

the data appears more evenly distributed without the outliers. Lets repeat the same for the independent variables

there are only 36 outliers in this dataset.

The data for mileage looks closer to being normally distributed after dropping the outliers

This is heavily skewed data. We can take a closer look at the data column to examine what should be dropped

the difference between the min value and the max alues would suggest that something is off. Looking at the data we can see that most values are pretty low. Most values don't range over between 5-6.5. Higher values are outliers. We can get rid of all values above 7.

this data looks closer to being normally distrubuted

lets take a look at the year data

looking at the outliers in this dataset they all appear to be older cars from 1989 and older

to avoid losing too much data, we can drop data and keep 99% of the data to improve model accuracy

drop the 1% and keep the rest of the data

some of the outliers were removed, improving the accuracy of the data to be used in the model

to use only the data the is useful for the analysis, we can reset the index

we can now view the descriptive statistics on the cleaned data, free of most outliers for improved data accuracy

Check for linear relationships between the independent variables and the dependent variable.

run residual plot to print graphs that show the relationships

none of the scatter plots show a linear relationship between the x and y values. This may be due to one of the variables no being normally distributed. lets transform the data, from the histograms earlier, the price histogram was the graph with the least normal distribution, lets transform that variable

now that the year data has been transformed, lets rerun the scatterplots

we can now see linear values in the scatter plots for all of the variables.

we can drop the table column, since we can rely on the transformed data in the new price column

Testing for Multicollinearity

Create Dummy Variables

Rearrange the columns

Create the base model

assign each of the variables to x and y since statmodels can not take full data frames

The model is accurate 75% of the time